Retrieving similar websites and web pages

نویسنده

Jevon van Dijk

چکیده

Similar web pages are pages that are about the same topic and of the same type. Sites about soccer clubs are related with all the soccer websites, but only similar with other soccer clubs. The goal of this research is to find an approach which, based on the textual content, can find similar pages given a page. The method used for this approach is a twofold method. The first task is trying to find similar websites, the second task is to find similar pages on those similar websites. Similarity is based on the textual content and use keyword extraction to identify the main topics. The tf*idf measure is used to identify the best keywords. For calculating the similarity between two pages the Cosine Similarity measure is used. The conceived approach gives some satisfying results for finding related websites and similar pages, finding similar websites is a difficult task. The conclusion of this research is that it is hard to find similar websites only based on the textual content. Finally, there are some suggestions given for further improvements.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SIGCSE: U: Focused Retrieval of University Course Descriptions from Highly Variable Sources

Finding topically relevant content from sparse disparate sources on the Web requires robust techniques. A focused web crawler is a type of crawler that attempts to make predictions about page relevance and traverse the web efficiently to retrieve relevant information. In this work, we design and test a novel framework of focused crawling tailored to extracting semantically relevant information ...

متن کامل

وب سنجیِ صفحات وب فارسی مرتبط با تغذیه براساس معیار سیلبرگ

Background and Aim: Considering the potential damages caused by inaccurate, inadequate and incomplete information published in web pages, the aim of this study was to evaluate Persian-language web pages containing nutritional information, using Silberg criteria. Materials and Methods: Internet pages related to nutrition were found in “peyvandha.ir” and by searching 20 nutrition-related keywo...

متن کامل

Structure-Based Web Pages Clustering

Recognizing similarities among the documents of a set is one of the objectives of retrieving information. The information related to the similarities of web pages can be used to present similar documents to users in order to retrieve considered information. In the present study, a new algorithm has been proposed to cluster web pages based on their structure. The proposed algorithm is based on h...

متن کامل

انطباق عناصر فرادادۀ وب‏سایت کتابخانه‏های مرکزی دانشگاه‏های علوم پزشکی با عناصر فرادادۀ هسته دوبلین

Introduction: Considering the importance of library websites in the establishment of communication and provision of services for their users, it is crucial to include those features in these websites which can lead to increased dynamism and optimal communication. The present study aimed at comparing Metadata elements of Dublin Core with those of the websites of Central Libraries of Medical Univ...

متن کامل

An Improved Approach to perform Crawling and avoid Duplicate Web Pages

When a web search is performed it includes many duplicate web pages or the websites. It means we can get number of similar pages at different web servers. We are proposing a Web Crawling Approach to Detect and avoid Duplicate or Near Duplicate WebPages. In this proposed work we are presenting a keyword Prioritization based approach to identify the web page over the web. As such pages will be id...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Retrieving similar websites and web pages

نویسنده

چکیده

منابع مشابه

SIGCSE: U: Focused Retrieval of University Course Descriptions from Highly Variable Sources

وب سنجیِ صفحات وب فارسی مرتبط با تغذیه براساس معیار سیلبرگ

Structure-Based Web Pages Clustering

انطباق عناصر فرادادۀ وب‏سایت کتابخانه‏های مرکزی دانشگاه‏های علوم پزشکی با عناصر فرادادۀ هسته دوبلین

An Improved Approach to perform Crawling and avoid Duplicate Web Pages

عنوان ژورنال:

اشتراک گذاری